Search CORE

7 research outputs found

A low-latency, big database system and browser for storage, querying and visualization of 3D genomic data

Author: Blanchette Mathieu
Butyaev Alexander
Cudré-Mauroux Philippe
Mavlyutov Ruslan
Waldispühl Jérôme
Publication venue
Publication date: 02/08/2017
Field of study

Recent releases of genome three-dimensional (3D) structures have the potential to transform our understanding of genomes. Nonetheless, the storage technology and visualization tools need to evolve to offer to the scientific community fast and convenient access to these data. We introduce simultaneously a database system to store and query 3D genomic data (3DBG), and a 3D genome browser to visualize and explore 3D genome structures (3DGB). We benchmark 3DBG against state-of-the-art systems and demonstrate that it is faster than previous solutions, and importantly gracefully scales with the size of data. We also illustrate the usefulness of our 3D genome Web browser to explore human genome structures. The 3D genome browser is available at http://3dgb.cs.mcgill.c

RERO DOC Digital Library

CoLLD: Contrastive Layer-to-layer Distillation for Compressing Multilingual Pre-trained Speech Encoders

Author: Chang Heng-Jui
Chung Yu-An
Dong Ning
Mavlyutov Ruslan
Popuri Sravya
Publication venue
Publication date: 14/09/2023
Field of study

Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech recognition and translation tasks. Due to the high cost of developing these large models, building new encoders for new tasks and deploying them to on-device applications are infeasible. Prior studies propose model compression methods to address this issue, but those works focus on smaller models and less realistic tasks. Thus, we propose Contrastive Layer-to-layer Distillation (CoLLD), a novel knowledge distillation method to compress pre-trained speech encoders by leveraging masked prediction and contrastive learning to train student models to copy the behavior of a large teacher model. CoLLD outperforms prior methods and closes the gap between small and large models on multilingual speech-to-text translation and recognition benchmarks.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Efficient document filtering using vector space topic expansion and pattern-mining: the case of event detection in microposts

Author: Aberer Karl
Castillo Carlos
Cudré-Mauroux Philippe
Mavlyutov Ruslan
Proskurnia Julia
Publication venue
Publication date: 04/04/2019
Field of study

Automatically extracting information from social media is challenging given that social content is often noisy, ambiguous, and inconsistent. However, as many stories break on social channels first before being picked up by mainstream media, developing methods to better handle social content is of utmost importance. In this paper, we propose a robust and effective approach to automatically identify microposts related to a specific topic defined by a small sample of reference documents. Our framework extracts clusters of semantically similar microposts that overlap with the reference documents, by extracting combinations of key features that define those clusters through frequent pattern mining. This allows us to construct compact and interpretable representations of the topic, dramatically decreasing the computational burden compared to classical clustering and k-NN-based machine learning techniques and producing highly-competitive results even with small training sets (less than 1'000 training objects). Our method is efficient and scales gracefully with large sets of incoming microposts. We experimentally validate our approach on a large corpus of over 60M microposts, showing that it significantly outperforms state-of-the-art techniques

RERO DOC Digital Library

Analyzing Large-Scale Public Campaigns on Twitter

Author: Aberer Karl
Cudre-Mauroux Philippe
Mavlyutov Ruslan
Prokofyev Roman
Proskurnia Julia
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/09/2016
Field of study

Social media has become an important instrument for running various types of public campaigns and mobilizing people. Yet, the dynamics of public campaigns on social networking platforms still remain largely unexplored. In this paper, we present an in-depth analysis of over one hundred large-scale campaigns on social media platforms covering more than 6 years. In particular, we focus on campaigns related to climate change on Twitter, which promote online activism to encourage, educate, and motivate people to react to the various issues raised by climate change. We propose a generic framework based on a crowdsourcing to identify both the type of a given campaign as well as the various actions undertaken throughout its lifespan: official meetings, physical actions, calls for action, publications on climate related research, etc. We study whether the type of a campaign is correlated to the actions undertaken and how these actions influence the flow of the campaign. Leveraging more than one hundred different campaigns, we build a model capable of accurately predicting the presence of individual actions in tweets. Finally, we explore the influence of active users on the overall campaign flow

Infoscience - École polytechnique fédérale de Lausanne

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

Author: Akula Bapi
Andrews Pierre
Balioglu Can
Barrault Loïc
Celebi Onur
Chen Peng-Jen
Chung Yu-An
Communication Seamless
Costa-jussà Marta R.
Dale David
Dong Ning
Duquenne Paul-Ambroise
Elbayad Maha
Ellis Brian
Elsahar Hady
Gao Cynthia
Gong Hongyu
Gonzalez Gabriel Mejia
Guzmán Francisco
Haaheim Justin
Hachem Naji El
Hansanti Prangthip
Heffernan Kevin
Hoffman John
Howes Russ
Huang Bernie
Hwang Min-Jae
Inaguma Hirofumi
Jain Somya
Kalbassi Elahe
Kallet Amanda
Kao Justine
Klaiber Christopher
Kulikov Ilia
Lam Janice
Lee Ann
Li Daniel
Li Pengwei
Licht Daniel
Ma Xutai
Maillard Jean
Mavlyutov Ruslan
Meglioli Mariano Cora
Mourachko Alexandre
Peloquin Benjamin
Pino Juan
Popuri Sravya
Rakotoarison Alice
Ramadan Mohamed
Ramakrishnan Abinesh
Ropers Christophe
Sadagopan Kaushik Ram
Saleem Safiyyah
Schwenk Holger
Sun Anna
Tomasello Paden
Tran Kevin
Tran Tuan
Tufanov Igor
Vogeti Vish
Wang Changhan
Wang Jeff
Wang Skyler
Wenzek Guillaume
Wood Carleigh
Yang Yilin
Ye Ethan
Yu Bokai
Publication venue
Publication date: 23/08/2023
Field of study

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communicatio

arXiv.org e-Print Archive

Diffusion Entropy and the Path Dimension of Frictional Finger Patterns

Author: Aberer Karl
Castillo Carlos
Cudré-Mauroux Philippe
Mavlyutov Ruslan
Proskurnia Julia
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2017
Field of study

The authors investigate, using both analytical and numerical methods, the entropy associated with a diffusion process inside frictional finger patterns. The entropy obtained from simulations of diffusion inside the pattern is compared to analytical predictions based on an effective continuum description. The analytical result predicts that the entropy depends in a particular way on the path dimension of the system, which governs the scaling of simple paths in the system. The findings indicates that there is a close analogy between the frictional fingers in the continuum and minimum spaning trees on the lattice, as the path dimension is found, through studies of the entropy, to be close to the defining value for the minimum spanning tree universality class

Infoscience - École polytechnique fédérale de Lausanne

Crossref

UPF Digital Repository

NORA - Norwegian Open Research Archives

A low-latency, big database system and browser for storage, querying and visualization of 3D genomic data

Author: Alexander Butyaev
Guttman
Jérôme Waldispühl
Mathieu Blanchette
Philippe Cudré-Mauroux
Ruslan Mavlyutov
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Crossref